We all know that NBA players get very high paid, it is interesting to find out what factor can affect NBA player’s salary.The purpose of this project is to build an linear regression models to predict NBA player salaries and find out which statistics are the best predictors of an NBA player’s salary, as well as some general analysis for NBA players.
The following data frame captures the recorded for NBA players for 2017 to 2018 season. The data set was retrieved from Kaggle. Attributes of the data includes the Age, height, weight for players and their position, games attended, Average Minutes played Per Game, Average Points Per Game, Player Efficiency Rating, etc..
As stated before, drug abuse is an significant cause of death in US. Thus, it is important for us to understand how the drugs and death are related and how can we prevent the drug abuse. In this part, team will do analysis for some general variables in the data set, such as the age, gender and race for the victims. And explore some interesting facts using the data set. The team will also do some analysis for the cause of death and how different type of drugs can affect the death.
library(plotly)
library(prob)
library(tidyr)
library(RColorBrewer)
# import datasets
season_stat_2017 <- read.csv(file = 'https://raw.githubusercontent.com/yuxiao-wu/R-Projects/main/NBA%20player%20salary%20prediction/Seasons_Stats.csv')
player <- read.csv(file = 'https://raw.githubusercontent.com/yuxiao-wu/R-Projects/main/NBA%20player%20salary%20prediction/Players.csv')
season17_salary <- read.csv(file = 'https://raw.githubusercontent.com/yuxiao-wu/R-Projects/main/NBA%20player%20salary%20prediction/Season17_salary.csv')
# combine data sets and clean NA data
data <- merge(x = season_stat_2017, y = player, by = 'Player')
data <- merge(x = data, y = season17_salary, by = 'Player')
data <- na.omit(data)
data<- data[data$Pos != 'PF-C',]
data <- arrange(data, desc(Salary))
data1 <- data[,c('FT', 'FTA', 'G', 'GS', 'FG', 'FGA', 'ORB', 'TOV', 'AST', 'OWS', 'DWS', 'BPM', 'VORP', 'X2P', 'X3P', 'DRB', 'STL', 'WS','height', 'weight', 'Salary')]Firstly, we might want to see who is the highest paid NBA player. As shown in the bar chart below, where x axis is Player name and y axis is the Salary, Stephen Curry receives the highest salary which is 35 million, followed by LeBron James, Paul Millsan…
# -------------------------- data exploration -----------------------------
fig <- plot_ly(
x = data$Player[1:10],y = data$Salary[1:10],type = "bar", marker = list(color = 'rgba(55, 128, 191, 0.7)'))%>%
layout(title = 'Top 10 NBA players Salary 2017',
xaxis = list(title = "Player Name", categoryorder = "total descending"),
yaxis = list(title = "Salary"))
figHow does the Salary distributed? Among all 509 players in the data set, an histogram plot for their salary was shown below. We can see that most of the player has salary lower than 5 million, while there are players have very high salary, greater than 20 million dollars.
# -------------------------- data exploration -----------------------------
p1 <- plot_ly(data, x = ~Salary, type = 'histogram', histfunc = 'avg',
name = 'death',
marker = list(color = c('rgba(99, 231, 234, 0.8)')))%>%
add_trace(x = ~Salary, name = 'death')%>%
layout(title = 'NBA Players Salary Season 2017-2018',
yaxis = list(title = 'Number of Player'),
showlegend = FALSE)
p1We want to take a look for how player’s height and weight can affect their salary. As shown in the scatter plots below, we can see that there is no obvious linear relationship between player’s weight or height and their salary. Correlation between height/weight and salary are also calculated below, and we can see both of them are very low, which means player’s height and weight is not a significant predictor for salary.
# -------------------------- correlation -----------------------------
fig <- plot_ly(data = data, x = ~height, y = ~Salary, type = 'scatter',
name = 'Player Height')%>%
layout(xaxis = list(title = "height/cm"),
yaxis = list(title = "Salary"))
fig2 <- plot_ly(data = data, x = ~weight, y = ~Salary, type = 'scatter',
name = 'Player Weight')%>%
layout(xaxis = list(title = "weight/kg"),
yaxis = list(title = "Salary"))
subplot(fig, fig2)%>%
layout(title = list(text = 'NBA player height and weight vs Salary'),
yaxis = list(title = "Salary"))cat(' correlation between player height and salary = ', cor(data$height, data$Salary),'\n')%>%
cat('correlation between player height and salary = ', cor(data$weight, data$Salary),'\n')## correlation between player height and salary = 0.1412969
## correlation between player height and salary = 0.1902711
Now we are interested in which variable has the highest correlation with players salary. Below is a bar chart plot variables and their correlation with salary. From the plot we can see WS(Win share) has the highest correlation with salary, followed by FG(Field Goal), etc..
# -------------------------- correlation -----------------------------
correlations <- cor(data1[,1:20],data1$Salary)
variables <- rownames(correlations)
correlation <- correlations[1:20]
fig <- plot_ly(
x = variables,
y = correlation,
type = "bar", marker = list(color = 'rgba(219, 64, 82, 0.7)'))%>%
layout(title = 'Correlations between each variables vs Salary',
xaxis = list(title = "Variables", categoryorder = "total descending"),
yaxis = list(title = "Correlation"))
figNow we want to predict the player’s salary using some attributes in data set. By looking at the corrleations, we found the attribute with highest correlation with salary is Win Share, which equals 0.737. So I decide to use WS to build a simple linear regression model to predict salary. As shown in the below graph, Win share vs Salary are ploted in a scatter plot. we can see there is a strong linear relationship between them.
cat('correlation between player height and salary = ', cor(data$Salary, data$WS),'\n')## correlation between player height and salary = 0.7372644
m0 <- lm(data$Salary ~ data$WS)
fig3 <- plot_ly(data = data, x = ~WS, y = ~Salary, color = ~Team,
hoverinfo = "text",type = 'scatter',mode = 'markers',
text = ~paste("Player: ", Player,
"<br>Salary: ", format(Salary, big.mark = ","),"$",
"<br>Win share: ", WS,
"<br>Team: ", Team))%>%
add_markers(y = ~Salary) %>%
add_trace(x = ~WS, y = 1731787 + 1914393 * data$WS, mode = "lines",
color = I("red"), name = 'regression line') %>%
layout(
title = "Salary vs Win Shares",
xaxis = list(title = "Win Shares"),
yaxis = list(title = "Salary USD"),
showlegend = T
)
fig3summary(m0)##
## Call:
## lm(formula = data$Salary ~ data$WS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19827738 -2642903 -743554 2016098 20997339
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1731787 295797 5.855 8.6e-09 ***
## data$WS 1914393 77911 24.572 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4942000 on 507 degrees of freedom
## Multiple R-squared: 0.5436, Adjusted R-squared: 0.5427
## F-statistic: 603.8 on 1 and 507 DF, p-value: < 2.2e-16
By using summary function, as the result shown, the linear equation can be written as y = 1731787 + 1914393x, The R-square for this model is 0.5144.
Now we want to add more attributes to the model, such as total number of games played, games started, Turnover percentage, field goals, 3-point goal, etc..
# -------------------------- linear regression -----------------------------
m <- lm(Salary~ .,data = data1)
summary(m)##
## Call:
## lm(formula = Salary ~ ., data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15646743 -2268351 -453140 1918560 17796189
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8770100.6 6222226.6 -1.409 0.159329
## FT -91472.7 26639.7 -3.434 0.000646 ***
## FTA 49152.7 16176.9 3.038 0.002505 **
## G -98360.9 18412.1 -5.342 1.41e-07 ***
## GS 48966.3 12817.4 3.820 0.000150 ***
## FG -104294.7 48678.4 -2.143 0.032644 *
## FGA 46102.3 16235.1 2.840 0.004704 **
## ORB -36218.6 13493.1 -2.684 0.007517 **
## TOV 55663.2 26083.2 2.134 0.033334 *
## AST -15928.5 9031.7 -1.764 0.078420 .
## OWS 5991573.4 4299718.7 1.393 0.164108
## DWS 6131522.1 4263620.2 1.438 0.151045
## BPM -19043.3 103950.4 -0.183 0.854720
## VORP -950500.1 499337.1 -1.904 0.057559 .
## X2P 23227.5 17238.3 1.347 0.178463
## X3P NA NA NA NA
## DRB -648.4 4579.1 -0.142 0.887454
## STL -14572.3 14871.4 -0.980 0.327624
## WS -2747074.0 4270752.1 -0.643 0.520377
## height 34865.8 40436.9 0.862 0.388985
## weight 48255.1 31828.2 1.516 0.130136
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4458000 on 489 degrees of freedom
## Multiple R-squared: 0.6417, Adjusted R-squared: 0.6278
## F-statistic: 46.1 on 19 and 489 DF, p-value: < 2.2e-16
And as the result shown above, the multivariate regression model has a R-squared value 0.6417 and adjusted R-squared value 0.628. We add total 20 attributes, while some of them are insignificant such as height and weight. Therefore, I decide to use forward and backward selection to choose the best variable to add.
From the result below we can see that by using backward selection, 5 variables was removed and 14 are selected. The resulted R-square is equal to 0.6402 and adjusted R-square is 0.63.
# backward selection
m1 <- step(m, direction = 'backward', trace=0)
summary(m1)##
## Call:
## lm(formula = Salary ~ FT + FTA + G + GS + FG + FGA + ORB + TOV +
## AST + OWS + DWS + VORP + X2P + weight, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15325911 -2275046 -464050 1923368 18051283
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4006614 2089159 -1.918 0.055711 .
## FT -91208 26461 -3.447 0.000615 ***
## FTA 48449 16035 3.021 0.002647 **
## G -101395 17085 -5.935 5.55e-09 ***
## GS 47732 12617 3.783 0.000174 ***
## FG -107173 48078 -2.229 0.026254 *
## FGA 46390 16086 2.884 0.004099 **
## ORB -37650 12252 -3.073 0.002237 **
## TOV 58246 25868 2.252 0.024784 *
## AST -18120 8852 -2.047 0.041183 *
## OWS 3345975 822354 4.069 5.50e-05 ***
## DWS 3234749 523740 6.176 1.37e-09 ***
## VORP -1060731 447887 -2.368 0.018254 *
## X2P 24994 16938 1.476 0.140695
## weight 71207 21328 3.339 0.000905 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4445000 on 494 degrees of freedom
## Multiple R-squared: 0.6402, Adjusted R-squared: 0.63
## F-statistic: 62.79 on 14 and 494 DF, p-value: < 2.2e-16
# forward selection
minModel <- lm(Salary~ 1, data = data1)
m2 <- step(minModel, direction = 'forward', trace=0,
scope = (~FT+ FTA+ G+ GS+ FG+ FGA+ ORB+TOV+ AST+ OWS+ DWS+ BPM+ VORP+ X2P+ X3P+ DRB+ STL+ WS+ height+ weight))
summary(m2)##
## Call:
## lm(formula = Salary ~ WS + GS + FGA + G + DWS + X3P + weight +
## VORP + ORB, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17429704 -2369006 -529227 2134028 18083316
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3287377 2054792 -1.600 0.1103
## WS 1187328 238833 4.971 9.16e-07 ***
## GS 50610 12492 4.051 5.90e-05 ***
## FGA 5332 1354 3.938 9.40e-05 ***
## G -89975 15831 -5.684 2.25e-08 ***
## DWS 1824468 438514 4.161 3.74e-05 ***
## X3P 12080 7098 1.702 0.0894 .
## weight 62634 20852 3.004 0.0028 **
## VORP -690862 380433 -1.816 0.0700 .
## ORB -11169 7356 -1.518 0.1296
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4482000 on 499 degrees of freedom
## Multiple R-squared: 0.6305, Adjusted R-squared: 0.6238
## F-statistic: 94.6 on 9 and 499 DF, p-value: < 2.2e-16
From the result above we can see that forward selection selected 9 variables inclued field goals, Games played, games started, etc.. 10 variables was removed. The resulted R-squared is 0.6305 and R-squared value is 0.6238.
We want to know whether the salary are different for each position, a box plot was created to show the salary distribution by different positions. There are five different positions in basketball games, which are Center, Shooting Guard, Point Guard, Power Forward and Small Forward. From the plot below we can see that Center seems have the highest average salary, followed by Power Forward. While there are many outliers with very high salary and the average salary among 5 positions are very close to each other. Therefore, we need to use one way ANOVA test was performed formally test whether there is a difference between each positions.
# box plot
fig <- plot_ly(x = data[data$Pos == 'C',]$Salary, type = "box",name = 'Center')%>%
add_trace(x = data[data$Pos == 'PF',]$Salary, name = 'Power Forward')%>%
add_trace(x = data[data$Pos == 'PG',]$Salary,name = 'Point Guard')%>%
add_trace(x = data[data$Pos == 'SF',]$Salary,name = 'Small Forward')%>%
add_trace(x = data[data$Pos == 'SG',]$Salary,name = 'Shooting Guard')%>%
layout(title = "Salary Distribution by Positions",
xaxis = list(title = "Salary"))
figFirst the hypotheses and alpha level was set up as below: H0 ∶ µC = µPF = µPG = µSF = µSG (Salary means for different positions are equal) H1: At least one of salary means is different from others α = 0.05 Then F test was select to perform the analysis because we have more than two items to compare. Because there are 5 groups and total 509 rows of data, we set df1 = k-1 = 4 and df2 = 509 - 4 = 504
cat("qf(0.95, df1 = 5, df2 = 504) =",qf(0.95, df1 = 5, df2 = 504))## qf(0.95, df1 = 5, df2 = 504) = 2.231899
data$Pos = factor(data$Pos)
m <- aov(data$Salary ~ data$Pos, data=data)
summary(m)## Df Sum Sq Mean Sq F value Pr(>F)
## data$Pos 4 5.630e+14 1.407e+14 2.671 0.0316 *
## Residuals 504 2.656e+16 5.270e+13
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the result above we can state that we will reject H0 if F is greater than 2.232, otherwise, do not reject H0. By using the aov and summary function, we get the F test statistic value to be 2.671, which is greater than 2.232. Therefore, we will reject null hypotheses. Which means we have significant evidence at α = 0.05 that there is a difference in Salary among different positions including Center, Power Forward, Point Guard, Small Forward and Shooting Guard.
After knowing the global F test is significant, we want to further determine which of the population group means are different. And t pairwise comparison was used to perform the analysis.
pairwise.t.test(data$Salary, data$Pos, p.adj="bonferroni")##
## Pairwise comparisons using t tests with pooled SD
##
## data: data$Salary and data$Pos
##
## C PF PG SF
## PF 0.943 - - -
## PG 0.033 1.000 - -
## SF 0.733 1.000 1.000 -
## SG 0.056 1.000 1.000 1.000
##
## P value adjustment method: bonferroni
From the pairwise comparisons result above we can see that all values are high except two pairs which are C - PG and C - SG. So an additional t test was perfromed to test whether the mean salary for Center and Point Guard are different at α = 0.05 level.
cat("qt(0.975, df= 504) =",qt(0.975, df= 504))## qt(0.975, df= 504) = 1.964682
a <- data[data$Pos == 'C',]$Salary
b <- data[data$Pos == 'PG',]$Salary
cat('test statistic = ', (mean(a) - mean(b))/sqrt(var(data$Salary)*(1/length(a) + 1/length(b))))## test statistic = 2.934203
Since t is greater than 1.96, we will reject H0, which means we are 95% confidence that the true salary mean is different for Point Guard and Center.
In conclusion, by using multivariate linear regression, a model was built to predict NBA players salary. Some significant predictors include Win Share, Field Goals, Games played, Games started, Total Points, etc.. Then the salary among different positions were compared, we found there is at least one pair of mean salary different from each other by using globe f test, and then t test was used to find out which two group has different mean salary.